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[57] ABSTRACT 

Serial analysis of gene expression, SAGE, a method for the 
rapid quantitative and qualitative analysis of transcripts is 
provided. Short defined sequence tags corresponding to 
expressed genes are isolated and analyzed. Sequencing of 
over 1,000 denned tags in a short period of time (e.g., hours) 
reveals a gene expression pattern characteristic of the func- 
tion of a cell or tissue. Moreover, SAGE is useful as a gene 
discovery tool for the identification and isolation of novel 
sequence tags corresponding to novel transcripts and genes. 

43 Claims, 4 Drawing Sheets 
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METHOD FOR SERIAL ANALYSIS OF GENE 
EXPRESSION 

This invention was made with support from National 
Institutes of Health Grant Nos. CA57345, C A3 5494, and 
GM07309. The Government has certain rights in this inven- 
tion. 

FIELD OF THE INVENTION 

The present invention relates generally to the field of gene 
expression and specifically to a method for the serial analy- 
sis of gene expression (SAGE) for the analysis of a large 
number of transcripts by identification of a defined region of 
a transcript which corresponds to a region of an expressed 
gene. 

BACKGROUND OF THE INVENTION 

Determination of the genomic sequence of higher 
organisms, including humans, is now a real and attainable 
goal. However, this analysis only represents one level of 
genetic complexity. The ordered and timely expression of 
genes represents another level of complexity equally impor- 
tant to the definition and biology of the organism. 

The role of sequencing complementary DNA (cDNA), 
reverse transcribed from mRNA, as part of the human 
genome project has been debated as proponents of genomic 
sequencing have argued the difficulty of finding every 
mRNA expressed in all tissues, cell types, and developmen- 
tal stages and have pointed out that much valuable infor- 
mation from intronic and intergenic regions, including con- 
trol and regulatory sequences, will be missed by cDNA 
sequencing (Report of the Committee on Mapping and 
Sequencing the Human Genome, National Academy Press, 
Washington, D.C., 1988). Sequencing of transcribed regions 
of the genome using cDNA libraries has heretofore been 
considered unsatisfactory. Libraries of cDNAare believed to 
be dorninated by repetitive elements, mitochondrial genes, 
ribosomal RNA genes, and other nuclear genes comprising 
common or housekeeping sequences. It is believed that 
cDNA libraries do not provide all sequences corresponding 
to structural and regulatory polypeptides or peptides 
(Putney, et aL, Nature, 302:718, 1983). 

Another drawback of standard cDNA cloning is that some 
mRNAs are abundant while others are rare. The cellular 
quantities of mRNA from various genes can vary by several 
orders of magnitude. 

Techniques based on cDNA subtraction or differential 
display can be quite useful for comparing gene expression 
differences between two cell types (Hedrick. et aL, Nature, 
308: 149, 1984; liang and Pardee, Science, 257: 967, 1992), 
but provide only a partial analysis, with no direct informa- 
tion regarding abundance of messenger RNA. The expressed 
sequence tag (EST) approach has been shown to be a 
valuable tool for gene discovery (Adams, et aL, Science 
252:1656, 1991; Adams, et aL, Nature, 355:632, 1992; 
Okubo et aL, Nature Genetics, 2: 173, 1992), but like 
Northern blotting, RNase protection, and reverse 
transcriptase-polymerase chain reaction (PT-PCR) analysis 
(Alwine, et aL, Prvc. Natl Acad Sck USjL, 74:5350, 1977; 
Zinn et aL, Ceh\ 34:865, 1983; Veres, et aL, Science, 
237:415, 1987), only evaluates a limited number of genes at 
a time. In addition, the EST approach preferably employs 
nucleotide sequences of 150 base pairs or longer for simi- 
larity searches and mapping. 

Sequence tagged sites (STSs) (Olson, et aL, Science, 
245:1434, 1989) have also been utilized to identify genomic 
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markers for the physical mapping of the genome. These 
short sequences from physically mapped clones represent 
uniquely identified map positions in the genome. In contrast, 
the identification of expressed genes relies on expressed 

5 sequence tags which are markers for those genes actually 
transcribed and expressed in vivo. 

There is a need for an improved method which allows 
rapid, detailed analysis of thousands of expressed genes for 
the investigation of a variety of biological applications, 

10 particularly for establishing the overall pattern of gene 
expression in different cell types or in the same cell type 
under different physiologic or pathologic conditions. Iden- 
tification of different patterns of expression has several 
utilities, including the identification of appropriate therapeu- 

15 tic targets, candidate genes for gene therapy (e.g., gene 
replacement), tissue typing, forensic identification, mapping 
locations of disease-associated genes, and for the identifi- 
cation of diagnostic and prognostic indicator genes. 

20 SUMMARY OF THE INVENTION 

The present invention provides a method for the rapid 
analysis of numerous transcripts in order to identify the 
overall pattern of gene expression in different cell types or 
in the same cell type under different physiologic, develop- 
mental or disease conditions. The method is based on the 
identification of a short nucleotide sequence tag at a defined 
position in a messenger RNA. The tag is used to identify the 
corresponding transcript and gene from which it was tran- 

^ scribed- By utilizing dimerized tags, termed a "dttag'% the 
method of the invention allows elimination of certain types 
of bias which might occur during cloning and/or amplifica- 
tion and possibly during data evaluation. Concatenation of 
these short nucleotide sequence tags allows the efficient 
analysis of transcripts in a serial manner by sequencing 
multiple tags on a single DNA molecule, for example, a 
DNA molecule inserted in a vector or in a single clone. 

The method described herein is the serial analysis of gene 
expression (SAGE), a novel approach which allows the 

^ analysis of a large number of transcripts. To demonstrate this 
strategy, short cDNA sequence tags were generated from 
mRNA isolated from pancreas, randomly paired to form 
ditags, concatenated, and cloned. Manual sequencing of 
1,000 tags revealed a gene expression pattern characteristic 

45 of pancreatic function. Identification of such patterns is 
important diagnosticaily and therapeutically, for example. 
Moreover, the use of SAGE as a gene discovery tool was 
documented by the identification and isolation of new pan- 
creatic transcripts corresponding to novel tags. SAGE pro- 

^ vides a broadly applicable means for the quantitative cata- 
loging and comparison of expressed genes in a variety of 
normal, developmental, and disease states. 

BRIEF DESCRIPTION OF THE DRAWINGS 

55 FIG. 1 shows a schematic of SAGE. The first restriction 
enzyme, or anchoring enzyme, is NlalH and the second 
enzyme, or tagging enzyme, is Fokl in this example. 
Sequences represent primer derived sequences, and tran- 
script derived sequences with "5C and **0" representing 

6o nucleotides of different tags. 

FIG. 2 shows a comparison of transcript abundance. Bars 
represent the percent abundance as determined by SAGE 
(dark bars) or hybridization analysis (light bars). SAGE 
quantitations were derived from Table 1 as follows: TRY 1/2 

65 includes the tags for trypsinogen 1 and 2, PROCAR indi- 
cates tags for procarboxypeptidase Al, CHYMO indicates 
tags for chymotrypsinogen, and ELA/PRO includes the tags 
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for clastasc BOOB and protease E. Error bars represent the 
standard deviation determined by taking the square root of 
counted events and converting it to a percent abundance 
(assumed Poisson distribution). 

FIG. 3 shows the results of screening a cDNA library with 
SAGE tags. PI and P2 show typical hybridization results 
obtained with 13 bp oligonucleotides as described in the 
Examples. PI and P2 correspond to the transcripts described 
in Table 2. Images were obtained using a Molecular Dynam- 
ics Phosphorlmager and the circle indicates the outline of 
the filter membrane to which the recombinant phage were 
transferred prior to hybridization. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

The present invention provides a rapid, quantitative pro- 
cess for determining the abundance and nature of transcripts 
corresponding to expressed genes. The method, termed 
serial analysis of gene expression (SAGE), is based on the 
identification of and characterization of partial, defined 
sequences of transcripts corresponding to gene segments. 
These defined transcript sequence 'tags'* are markers for 
genes which arc expressed in a cell, a tissue, or an extract, 
for example. 

SAGE is based on several principles. First, a short nucle- 
otide sequence tag (9 to 10 bp) contains sufficient informa- 
tion content to uniquely identify a transcript provided it is 
isolated from a defined position within the transcript. For 
example, a sequence as short as 9 bp can distinguish 262. 
144 transcripts (4 9 ) given a random nucleotide distribution 
at the tag site, whereas estimates suggest that the human 
genome encodes about 80,000 to 200,000 transcripts (Fields, 
et al., Nature Gentries, 7345 1994). The size of the tag can 
be shorter for lower eukaryotes or prokaryotes. for example, 
where the number of transcripts encoded by the genome is 
lower. For example, a tag as short as 6-7 bp may be 
sufficient for distinguishing transcripts in yeast 

Second, random dimerization of tags allows a procedure 
for reducing bias (caused by amplification and/or cloning). 
Third, concatenation of these short sequence tags allows the 
efficient analysis of transcripts in a serial manner by 
sequencing multiple tags within a single vector or clone. As 
with serial communication by computers, wherein informa- 
tion is transmitted as a continuous string of data, serial 
analysis of the sequence tags requires a means to establish 
the register and boundaries of each tag. All of these prin- 
ciples may be applied independently, in combination, or in 
combination with other known methods of sequence iden- 
tification. 

In a first emrxxfiment, the invention provides a method for 
the detection of gene expression in a particular cell or tissue, 
or cell extract, for example, including at a particular devel- 
opmental stage or in a particular disease state. The method 
comprises producing complementary deoxyribonucleic acid 
(cDNA) oligonucleotides, isolating a first defined nucleotide 
sequence tag from a first cDNA oligonucleotide and a 
second defined nucleotide sequence tag from a second 
cDNA oligonucleotide, linking the first tag to a first oligo- 
nucleotide linker, wherein the first oligonucleotide linker 
comprises a first sequence for hybridization of an amplifi- 
cation primer and linking the second tag to a second oligo- 
nucleotide linker, wherein the second oligonucleotide linker 
comprises a second sequence for hybridization of an ampli- 
fication primer, and detennining the nucleotide sequence of 
the tag(s), wherein the tag(s) correspond to an expressed 
gene. 
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FIG. 1 shows a schematic representation of the analysis of 
messenger RNA (mRNA) using SAGE as described in the 
method of the invention. mRNA is isolated from a cell or 
tissue of interest for in vitro synthesis of a double-stranded 
5 DNA sequence by reverse transcription of the mRNA. The 
double-stranded DNA complement of mRNA formed is 
referred to as complementary (cDNA). 

The term "oligonucleotide" as used herein refers to prim- 
ers or oligomer fragments comprised of two or more deox- 
io yribonucleotides or ribonucleotides, preferably more than 
three. The exact size will depend on many factors, which in 
turn depend on the ultimate function or use of the oligo- 
nucleotide. 

The method further includes ligating the first tag linked to 
1 5 the first oligonucleotide linker to the second tag linked to the 
second oligonucleotide linker and forming a "ditag". Each 
ditag represents two defined nucleotide sequences of at least 
one transcript, representative of at least one gene. Typically, 
a ditag represents two transcripts from two distinct genes. 
20 The presence of a defined cDNA tag within the ditag is 
indicative of expression of a gene having a sequence of that 
tag. 

The analysis of ditags, formed prior to any amplification 
step, provides a means to eliminate potential distortions 

25 introduced by amplification, e.g., PCR. The pairing of tags 
for the formation of ditags is a random event The number 
of different tags is expected to be large, therefore, the 
probability of any two tags being coupled in the same ditag 
is small, even for abundant transcripts. Therefore, repeated 

30 ditags potentially produced by biased standard amplification 
and/or cloning methods are excluded from analysis by the 
method of the invention. 

The term "defined" nucleotide sequence, or "defined" 

s5 nucleotide sequence tag, refers to a nucleotide sequence 
derived from either the 5* or 3' terminus of a transcript. The 
sequence is defined by cleavage with a first restriction 
endonuclease, and represents nucleotides either 5' or 3' of 
the first restriction endonuclease site, depending on which 

^ teaminus is used for capture (e.g., 3' when oligo-dT is used 
for capture as described herein). 

As used herein, the terms Restriction endonucleases" and 
Restriction enzymes" refer to bacterial enzymes which bind 
to a specific double-stranded DNA sequence termed a rec- 

4 5 ognition site or recognition nucleotide sequence, and cut 
double-stranded DNA at or near the specific recognition site. 

The first endonuclease, termed "anchoring enzyme" or 
"AE" in FIG. 1, is selected by its ability to cleave a transcript 
at least one time and therefore produce a defined sequence 

50 tag from either the 5' or 3' end of a transcript Preferably, a 
restriction endonuclease having at least one recognition site 
and therefore having the ability to cleave a majority of 
cDNAs is utilized. For example, as illustrated herein, 
enzymes which have a 4 base pair recognition site are 

55 expected to cleave every 256 base pairs (4 4 ) on average 
while most transcripts are considerably larger. Restriction 
endonucleases which recognize a 4 base pair site include 
NlalU, as exemplified in the EXAMPLES of the present 
invention. Other similar endonucleases having at least one 

60 recognition site within a DNA molecule (e.g., cDNA) will be 
known to those of skill in the art (see for example, Current 
Protocols in Molecular Biology, Vol. 2, 1995, Ed. Ausubel, 
et al., Greene Publish. Assoc. & Wiley Inferscience, Unit 
3.1.15; New England Biolabs Catalog, 1995). 

65 After cleavage with the anchoring enzyme, the most 5* or 
3' region of the cleaved cDNA can then be isolated by 
binding to a capture medium. For example, as illustrated in 
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the present EXAMPLES, streptavidin beads are used to 3 ATGGTCGAATAAGTTAAGCCAGGAGAGCGTG- 

isolate the defined 3* nucleotide sequence tag when the oligo TCCCT-5' (SEQ ID NO:2) and 

<IT primer for cDNA synthesis is biotinylatect In this 5*. TTTTTGTAGACATTCTAGTATCTCGTCAAGTC- 

example, cleavage with the first or anchoring enzyme pro- GGAAGGGACATG-3' (SEQ ID N03) 

vides a unique site on each transcript which corresponds to 5 A A r at^tpt a a r atp ^ta pa a /-rr/- A ^ 

the restriction site located closest to the poly-A tail. ^ S - ^^CTCT^G^CATAGAGCAGTTCAGCC- 

likewise, the 5' cap of a transcript (the cDNA) can be TTCCCT-5 (SEQ ED NO:4), wherein A is a dideoxy nucle- 

utilized for labeling or binding a capture means for isolation otide ( e *S-> dideox y A >* Other similar linkers can be utilized 

of a 5* defined nucleotide sequence tag. Those of skill in the in t f ie method of the invention; those of skill in the art can 

art will know other similar capture systems (e.g., biotin/ design such alternate linkers. 

streptavidin, digoxigenin/antiniigoxigenin) for isolation of 10 The linkers are designed so that cleavage of the ligation 

the defined sequence tag as described herein. products with the second restriction enzyme, or tagging 

The invention is not limited to use of a single "anchoring" enzyme, results in release of the linker having a defined 

or first restriction endonuclease. It may be desirable to nucleotide sequence tag (e.g., 3* of the restriction endonu- 

perform the method of the invention sequentially, using clease cleavage site as exemplified herein). The defined 

different enzymes on separate samples of a preparation, in nucleotide sequence tag may be from about 6 to 30 base 

order to identify a complete pattern of trans cription f or a cell pairs. Preferably, the tag is about 9 to 11 base pairs, 

or tissue. In addition, the use of more than one anchoring Therefore, a ditag is from about 12 to 60 base pairs, and 

enzyme provides confirmation of the expression pattern preferably from 18 to 22 base pairs. 

obtained from the first anchoring enzyme. Therefore, it is 2Q The pool of defined tags ligated to linkers having the same 

also envisioned that the first or anchoring endonuclease may sequence, or the two pools of defined nucleotide sequence 

rarely cut cDNA such that few or no cDNA representing tags ligated to linkers having different nucleotide sequences, 

abundant transcripts are cleaved. Thus, transcripts which are are randomly ligated to each other 'tail to tail'*. The portion 

cleaved represent "unique" transcripts. Restriction enzymes of the cDNA tag furthest from the linker is referred to as the 

that have a 7-8 bp recognition site for example, would be "tail". As illustrated in FIG. 1, the ligated tag pair, or ditag, 

enzymes that would rarely cut cDNA. Similarly, more than has a first restriction endonuclease site upstream (5') and a 

one tagging enzyme, described below, can be utilized in first restriction endonuclease site downstream (3') of the 

order to identify a complete pattern of transcription. ditag; a second restriction endonuclease cleavage site 

The term "isolated" as used herein includes polynucle- upstream and downstream of the ditag, and a linker oligo- 

otides substantially free of other nucleic acids, proteins, 3Q nucleotide containing both a second restriction enzyme 

lipids, carbohydrates or other materials with which it is recognition site and an amplification primer hybridization 

naturally associated. cDNA is not naturally occurring as site upstream and downstream of the ditag. In other words, 

such, but rather is obtained via manipulation of a partially the ditag is flanked by the first restriction endonuclease site, 

purified naturally occurring mRNA. Isolation of a defined the second restriction endonuclease cleavage site and the 

sequence tag refers to the purification of the 5' or 3 T tag from 35 linkers, respectively. 

other cleaved cDNA. The ditag can be amplified by utilizing primers which 

In one embodiment; the isolated defined nucleotide specifically hybridize to one strand of each linker, 

sequence tags are separated into two pools of cDNA, when Preferably, the amplification is performed by standard poly- 

the linkers have different sequences. Each pool is ligated via merase chain reaction (PCR)methods as described ( U.S. Pat 

* the anchoring, or first restriction endonuclease site to one of ^ No. 4,683,195). Alternatively, the ditags can be amplified by 

two linkers. When the linkers have the same sequence, it is cloning in procaryotic-compatibie vectors or by other ampli- 

not necessary to separate the tags into pools. The first fication methods known to those of skill in the art. 

oligonucleotide linker comprises a first sequence for hybrid- The term "primer" as used herein refers to an 

ization of an amplification primer and the second oligo- oligonucleotide, whether occurring naturally or produced 

nucleotide linker comprises a second sequence for hybrid- 45 synthetically, which is capable of acting as a point of 

ization of an amplification primer. In addition, the linkers initiation of synthesis when placed under conditions in 

further comprise a second restriction endonuclease site, also which synthesis of primer extension product which is 

termed the "tagging enzyme" or *TE". .The method of the complementary to a nucleic acid strand is induced, i.e., in the 

invention does not require, but preferably comprises ampli- presence of nucleotides and an agent for polymerization 

fying the ditag oligonucleotide after ligation. 50 such as DNA polymerase and at a suitable temperature and 

The second restriction endonuclease cleaves at a site pH. The primer is preferably single stranded for maximum 

distant from or outside of the recognition site. For example, efliciency in amplification. Preferably, the primer is an 

the second restriction endonuclease can be a type IIS restric- oligodeoxy ribonucleotide. The primer must be sufficiently 

tion enzyme. Type US restriction endonucleases cleave at a long to prime the synthesis of extension products in the 

defined distance up to 20 bp away from their asymmetric 5S presence of the agent for polymerization. The exact lengths 

recognition sites (Szybalski, W., Gene, 40; 169, 1985). of the primers will depend on many factors, including 

Examples of type US restriction endonucleases include * temperature and source of primer. 

BsmFI and FofcL Other similar enzymes will be known to The primers herein are selected to be "substantially" 

those of skill in the art (see, Current Protocols in Molecular complementary to the different strands of each specific 

Biology, supra). 60 sequence to be amplified. This means that the primers must 

The first and second "linkers" which are ligated to the be sufirciently complementary to hybridize with their 

defined nucleotide sequence tags are oligonucleotides hav- respective strands. Therefore, the primer sequence need not 

ing the same or different nucleotide sequences. For example, reflect the exact sequence of the template. In the present 

the linkers illustrated in the Examples of the present inven- invention, the primers are substantially complementary to 

tion include linkers having different sequences: 65 the oligonucleotide linkers. 

5-ill l ACCAGCITArTC AATTCGGTCCTCTC GCA- Primers useful for amplification of the linkers exemplified 

CAGGGACATG-3* (SEQ ID NO:l) herein as SEQ DD NO: 1-4 include 5'-CCA(XTITXrTCA- 
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XTTCGGTCC-3* (SEQ ID NO:5) and 5 -GTAGACATTC- In another embodiment, the present invention provides a 

TAGTATCTCGT-3' (SEQ ID NO:6). Those of skill in the art kit used for detection of gene expression wherein the pres- 

can prepare similar primers for amplification based on the ence of a defined nucleotide tag or ditag is indicative of 

nucleotide sequence of the linkers without undue experi- expression of a gene having a sequence of the tag, the kit 

mentation. 5 comprising one or more containers comprising a first con- 

Cleayage of the amplified PGR product with the first tainer containing a first oligonucleotide linker having a first 

restriction endonuclcase allows isolation of ditags which can sequence useful hybridization of an amplification primer; a 

be concatenated by ligation. After ligation, it may be desk- second container containing a second oligonucleotide linker 

able to clone the concatemers, although it is not required in having a second oligonucleotide linker having a second 

the method of the invention. Analysis of theditags or 1Q sequettce used hybridization of an amplification primer, 

concatemers, whether or nc< amphficaUon was formed, is whereill ^ risc a restric tion endonu- 

* * f v lea r ge of DK ^ at a T "T ST *5 

be apparent that the number of ditags which can be concat- . ^ * - ^ * , . r ~ _ /. . 

cnated will depend on the length of the individual tags and 15 i 2 **™ t0 the ? st se£ T omi sequence of the knker. 

, A w ^ ™ .7T . B t It is apparent that if the oligonucleotide linkers comprise the 

can be readily determined by those of skill in the art without , ^gvuuucuuuc U1C 

* . iJtJ A L ^ £ ^ mi * * same nucleotide sequence, only one container containing 

undue experimentation. After formation of concatemers, - . . . - *u i ■* J ■ w 

, , , 5 , . ' linkers is necessary m the kit of the invention, 

multiple tags can be cloned into a vector for sequence J 

analysis, or alternatively, ditags or concatemers can be 131 vet another embodiment, the invention provides an 

directly sequenced without cloning by methods known to 20 oligonucleotide composition having at least two defined 

those of skill in the art nucleotide sequence tags, wherein at least one of the 

Among the standard procedures for cloning the defined sequence tags corresponds to at least one expressed gene 

nucleotide sequence tags of the invention is insertion of the ™% composition consists of about 1 to 200 ditags, and 

tags into vectors such as plasmids or phage. The ditag or „ P^J***y * to 20 ditags. Such composiUons are 

concatemers of ditags produced by the method described 25 *?f*** me t ° f ^ ex P resslon by the 

herein are cloned into recombinant vectors for further defined nucleotide sequence tag corresponding to an 

analysis, e.g., sequence analysis, plaque/plasmid hybridiza- expressed gene in a cell, tissue or cell extract, for example, 

don using the tags as probes, by methods known to those of T* 16 following examples are intended to illustrate but not 

skill in the art. lircdt the invention. While they are typical of those that might 

The term "recombinant vector" refers to a plasmid, virus 3 ° be used > Qtner procedures known to those skilled in the art 

or other vehicle known in the art that has been mampuiated ma y alternatively be used. 
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by insertion or incorporation of the ditag genetic sequences. 
Such vectors contain a promoter sequence which facilitates 

the efficient transcription of the a marker genetic sequence 35 For exemplary purposes, the SAGE method of the inven- 

for example. The vector typically contains an origin of tion was used to characterized gene expression in the human 

replication, a promoter, as well as specific genes which pancreas. NlaUI was utilized as the first restriction 

allow phenotypic selection of the transformed cells. Vectors endonuclease, or anchoring enzyme, and BsmFI as the 

suitable for use in the present invention include for example, second restriction endonuclease, or tagging enzyme, yield- 

pBlueScript (Stratagene, La JolU, Calif.); pBC, pSL301 ^ i ng a 9 bp tag (BsmFI was predicted to cleave the compie- 

(Invitrogen) and other similar vectors known to those of skill mentary strand 14 bp 3' to the recognition site GGGAC and 

in the art. Preferably, the ditags or concatemers thereof are to yield a 4 bp 5* overhang (New England BioLabs). 

ligated into a vector for sequencing purposes. Overlapping the BsmFI and NlaTH (CATG) sites as indicated 

Vectors in which the ditags are cloned can be transferred (GGGACATG) would be predicted to result in a 11 bp tag. 
into a suitable host celL "Host cells** are cells in which a 45 However, analysis suggested that under the cleavage con- 
vector can be propagated and its DNA expressed. The term ditions used (37° C.), BsmFI often cleaved closer to its 
also includes any progeny of the subject host cell. It is recognition site leaving a minim um of 12 bp 3' of its 
understood that all progeny may not be identical to the recognition site. Therefore, only the 9 bp closest to the 
parental cell since there may be mutations that occur during anchoring enzyme site was used for analysis of tags. Cleav- 
replication. However, such progeny are included when the 50 age at 65° C. results in a more consistent 11 bp tag. 
term "host ceil" is used. Methods of stable transfer, meaning Computer analysis of human transcripts from Gen Bank 
that the foreign DNA is continuously maintained in the host, indicated that greater than 95% of tags of 9 bp in length were 
are known in the art likely to be unique and that inclusion of two additional bases 

Transformation of a host cell with a vector containing provided little additional resolution. Human sequences (84, 

ditag(s) may be carried out by conventional techniques as 55 300) were extracted from the GenBank 87 database using 

are well known to those skilled in the art. Where the host is the Findseq program provided on the IntelliGenetics Bionet 

prokaryotic, such as E. coli, competent cells which are on-line service. All further analysis was performed with a 

capable of DNA uptake can be prepared from cells harvested SAGE program group written in Microsoft Visual Basic for 

after exponential growth phase and subsequently treated by the Microsoft Windows operating system. The SAGE data- 

the CaCl 2 method using procedures well known in the art. 60 base analysis program was set to include only sequences 

Alternatively, MgCl 2 or RbCl can be used. Transformation noted as **RNA" in the locus description and to exclude 

can also be performed by elcctropotation or other commonly entries noted as "EST**, resulting in a reduction to 13.241 

used methods in the art sequences. Analysis of this subset of sequences using NlaTTT 

The ditags present in a particular clone can be sequenced as anchoring Enzyme indicated that 4*127 nine bp tags were 

by standard methods (see for example, Current Protocols in 65 unique while 1,511 tags were found in more than one entry. 

Molecular Biology, supra, Unit 7) either manually or using Nucleotide comparison of a randomly chosen subset (100) 

automated methods. of the latter entries indicated that at least 83% were due to 
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redundant data base entries for the same gene or highly 
related genes (>95% identity over at least 250 bp). This 
suggested that 5381 of the 9 bp tags (95.5%) were unique to 
a transcript or highly conserved transcript family. Likewise, 
analysis of the same subset of GenBank with an 11 bp tag 
resulted only in a 6% decrease in repeated tags (1511 to 
1425) instead of the 94% decrease expected if the repeated 
tags were due to unrelated transcripts. 

Example 1 

As outlined above, mRNA from human pancreas was 
used to generate ditags. Briefly, five ug mRNA from total 
pancreas (Clontech) was converted to double stranded 
cDNA using a BRL cDNA synthesis kit following the 
manufacturer's protocol, using the primer biotin-5T ls -3\ 
The cDNA was then cleaved with NlaHJ and the 3' restric- 
tion fragments isolated by binding to magnetic streptavidin 
beads (Dynal). The bound DNA was divided into two pools, 
and one of the following linkers Hgated to each pool: 

S-TITTACCAGCITArTCAArrCGGTCCTCTCGCA- 
CAGGGACAXG-3' 3 - ATGCTCGAATAAGTTAAGCC A- 
GGAGAGCGTGTCCCT-5' (SEQ ID NO:l and 2) 

5 '-TITTTGTAGAC ATTCTAGnrATCTCGTC AAGTCG- 
GAAGGGACATG-3* 3 , -AACATCTGTAAGArCATAGA- 
GCAGrrCAGCCTTCCCT-5' (SEQ ID N0*3 and 4), where 
A is a dideoxy nucleotide (e.g., dideoxy A). 

After extensive washing to remove unligated linkers, the 
linkers and adjacent tags were released by cleavage with 
BsmFL The resulting overhangs were filled in with T4 
polymerase and the pools combined and ligated to each 
other. The desired ligation product was then amplified for 25 
cycles using 5 , -CCAGCTTATTCAA^TCGGTCC-3 , and 
5'-GTAGAC ATTCTACTATCTCGT-3 ' (SEQ ID NO:5 and 
6, respectively) as primers. The PCR reaction was then 
analyzed by polyacrylamide gel electrophoresis and the 
desired product excised. An additional 15 cycles of PCR 
were then performed to generate sufficient product for 
efficient ligation and cloning. 

The PCR ditag products were cleaved with Nlalll and the 
band containing the ditags was excised and self-ligated. 
After ligation, the concatenated ditags were separated by 
polyacylamide gel electrophoresis and products greater than 
200 bp were excised. These products were cloned into the 
SphI site of pSL301 (Ihvitrogen). Colonies were screened 
for inserts by PCR using T7 and T3 sequences outside the 
cloning site as primers. Clones containing at least 10 tags 
(range 10 to 50 tags) were identified by PCR amplification 
and manually sequenced as described (Del Sal, et al., 
Biotechniques 7:514, 1989) using 
5 , -GACGTCGACCTGAGGTAArTATAACC-3' (SEQ ID 
NO:7) as primer. Sequence files were analyzed using the 
SAGE software group which identifies the anchoring 
enzyme site with the proper spacing and extracts the two 
intervening tags and records them in a database. The 1,000 
tags were derived from 413 unique ditags and 87 repeated 
ditags. The latter were only counted once to elirninate 
potential PCR bias of the quantitation. The function of 
SAGE software is merely to optimize the search for gene 
sequences. 

Table 1 shows analysis of the first 1,000 tags. Sixteen 
percent were eliminated because they either had sequence 
ambiguities or were derived form linker sequences. The 
remaining 840 tags included 351 tags that occurred once and 
77 tags that were found multiple times. Nine of the ten most 
abundant tags matched at least one entry in GenBank R87. 
The remaining tag was subsequently shown to be derived 



from amylase. All ten transcripts were derived from genes of 
known pancreatic function and their prevalence was consis- 
tent with previous analyses of pancreatic RNA using con- 
ventional approaches (Han, et aL, Proc. Natl Acad ScL 
U.S.A. 83:110, 1986; Takeda, et al., Hum. Hoi Gen., 2; 
1793, 1993). 

TABLE 1 
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Pancreatic SAGE Tags 





TAG 


Gene 




Per- 
cent 




GAGCACACC 


Procarboxypeptidase Al (X67318) 


64 


7.6 


15 


TTCTGTGTG 








GAACACAAA 


ChvTnotrvnsinniren fM744flfn 


37 


A A 




tcagggtga 




31 


3.7 




GCGTGACCA 


Elastase U1B (Ml 8692) 


20 






GTGTGTGCT 




T A 










lo 


1.9 




CCARAfiAGT 


Pr*v*»r+wrvr^nf irises /"VfRlDST^ 


14 


I./ 


20 


TCCTCAAAA 


No Match, See Table 2, PI 


14 


1 *T 

i./ 






RiIa Salt Sttfrmlaf<vi T ma<n» rviAA <rr\ 


i 


1 .4 




GTGTGCGCT 


No Matcli 


11 


1.3 




TOCGAGACC 


No Match, Sec Table 2, P2 


9 


1.1 




GTGAAAC CC 


21 Alu entries 


8 


1.0 




GGTGACTCT 


No Match 


8 


1.0 


25 


AAGGTAACA 


Secretary Trypsin Inhibitor (M11949) 


6 


0.7 


TCCCCTGTG 


No Match 


5 


0.6 




GTGACCACG 


No Match 


5 


0.6 




CCTGTAATC 


M91159, M29366, 11 Alu entries 


5 


0.6 




CACGTTGGA 


No Match 


5 


0.6 




AGCCCTACA 


No Match 


5 


0.6 


30 


AGCAcercc 


Elongation Factor 2 (211692) 


5 


0.6 




ACGCAGGGA 


No Match, See Table 2, P3 


5 


0.6 




AATTGAAGA 


No Match, See Table 2, P4 


5 


0.6 




TTCTGTGGG 


No Match 


4 


0.5 




TTCATACAC 


No Match 


4 


0-5 




GTGGCAGGC 


NF-kB(X6l499), Aha entry (S94541) 


4 


0.5 


35 


GTAAAACCC 


TNF receptor U (M55994), 


4 


0.5 




Aha entry (X01448) 








GAACACACA 


No Match 


4 


0.5 




CCTGGGAAG 


Pancreatic Mucin (J05582) 


4 


0.5 




CCCATCGTC 


Mitochondrial CytC Oxidase 


4 


0.5 




(SEQ ID 


(X15759) 








NO:8-37) 








40 


Summary 










SAGE tags 


Greater than three times 


380 


45.2 




Occurring 


Three times (15 x 3=) 


45 


5.4 






Two times (32 x 2=) 


64 


7.6 






One timr 


351 


41.8 


45 




Total SAGE Tags 


840 


100.0 



50 



55 



*Tag" indicates the 9 bp sequence unique to each tag, 
adjacent to the 4 bp anchoring Nlam site. "NT and "Percent" 
indicates the number of times the tag was identified and its 
frequency, respectively. "Gene" indicates the accession 
number and description of GenBank RS7 entries found to 
match the indicated tag using the SAGE software group with 
the following exceptions. When multiple entries were iden- 
tified because of duplicated entries, only one entry is listed 
In the cases of chymotrypsinogen, and trypsinogen 1, other 
genes were identified that were predicted to contain the same 
tags, but subsequent hybridization and sequence analysis 
identified the listed genes as the source of the tags. "Alu 
entry" indicates a match with a GenBank entry for a tran- 
60 script that contained at least one copy of the alu consensus 
sequence (Deininger, et al., J. Afol Biol, 151:17, 1981). 

Example 2 

The quantitative nature of SAGE was evaluated by con- 
65 srruction of an oligo-dT primed pancreatic cDNA library 
which was screened with cDNA probes for trypsinogen 1/2, 
procarboxpeptidase Al, chymotrypsinogen and elastase 
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mB/protease E. Pancreatic mRNA from the same prepara- 
tion as used for SAGE in Example 1 was used to construct 
a cDNA library in the ZAP Express vector using the ZAP 
Express cDNA Synthesis kit following the manufacturer's 
protocol (Stratagcne). Analysis of 15 randomly selected 
clones indicated that 100% contained cDNA inserts. Plates 
containing 250 to 500 plaques were hybridized as previously 
described (Ruppert, et ai., MoL Cell. Biol 8:3104, 1988), 
cDNA probes for trypsinogen 1, trypsinogen 2, procarbox- 
ypeptidase Al, chymotrypsinogen, and elastase TTTR were 
derived by RT-PCR from pancreas RNA. The trypsinogen 1 
and 2 probes were 93% identical and hybridized to the same 
plaques under the conditions used. Likewise, the elastase 



12 



that predicted by SAGE (Table 2). Tags PI and P2 were 
found to correspond to amylase and preprocarboxypeptidase 
A2, respectively. No entry for procarboxypeptidase A2 
and only a truncated entry for amylase was present in 
GenBank R87, thus accounting for their unassigned char- 
acterization. Tag P3 did not match any genes of known 
function in GenBank but did match numerous EST's, pro- 
viding further evidence that it represented a bona fide 
transcript. The cDNA identified by P4 showed no significant 
homology, suggesting that it represented a previously 
uncharacterized pancreatic transcript. 



TABLE 2 



Characterization of Unassigned SAGE Tags 



TAG 



Abun- 
dance 



SAGE 



SAGE._ 



13 merHyb Tag Description 



PI TCCTCAAAA 
(SEQ ID NO:38) 
P2 TGCGAGACC 
(SEQ TD KO:39) 
P3 ACGCAGGGA 
(SEQ ID NO:40) 
P4 AATTG AAG A 
(SEQ ID NO:41) 



1.7% 
1.1% 
0.6% 
0.6% 



1J% (6/388) 
1.2% (43/3700) 



3' end of Pancreatic Amylase (M28443) 



3' end of Preptocarboxypeptidase A2 
(U19977) 

0.2% (5/2772) + EST match (R45808) 
0.4% (#1587) + no match 



30 



35 



TUB probe and protease E probe were over 95% identical 
and hybridized to the same plaques. 

The relative abundance of the SAGE tags for these 
transcripts was in excellent agreement with the results 
obtained with library screening (FIG. 2). Furthermore, 
whereas neither trypsinogen 1 and 2 nor elastase 111B and 
protease E could be distinguished by the cDNA probes used 
to screen the library, all four transcripts could readily be 
distinguished on the basis of their SAGE tags (Table 1). ^ 

Example 3 

In addition to providing quantitative information on the 
abundance of known transcripts, SAGE could be used to 
identify novel expressed genes. While for the purposes of 45 
the SAGE analysis in this example, only the 9 bp sequence 
unique to each transcript was considered, each SAGE tag 
defined a 13 bp sequence composed of the anchoring 
enzyme (4 bp) site plus the 9 bp tag. To illustrate this 
potential, 13 bp oligonucleotides were used to isolate the 50 
transcripts corresponding to four unassigned tags (PI to P4), 
that is, tags without corresponding entries from GenBank 
R87 (Table 1). In each of the four cases, it was possible to 
isolate multiple cDNA clones for the tag by simply screen- 
ing the pancreatic cDNA library using 13 bp oligonucleotide 55 
as hybridization probe (examples in FIG. 3). 

Plates containing 250 to 2,000 plaques were hybridized to 
oligonucleotide probes using the same conditions previously 
described for standard probes except mat the hybridization 
temperature was reduced to room temperature. Washes were 60 
performed in 6xSSC/0A% SDS for 30 minutes at room 
temperature. The probes consisted of 13 bp oligonucleotides 
which were labeled with y 32 P-ATP using T4 polynucleotide 
kinase. In each case, sequencing of the derived clones 
identified the correct SAGE tag at the predicted 3* end of the 65 
identified transcript The abundance of plaques identified by 
hybridization with the 13-mcrs was in good agreement with 



"Tag" and "SAGE Abundance" are described in Table 1; 
"13mer Hyb" indicates the results obtained by screening a 
cDNA library with a 13mer, as described above. The number 
of positive plaques divided by the total plaques screened is 
indicated in parentheses following the percent abundance. A 
positive in the "SAGE Tag" column indicates that the 
expected SAGE tag sequence was identified near the 3' end 
of isolated clones. "Description'' indicates the results of 
BLAST searches of the daily updated GenBank entries at 
NCBI a of 6/9/95 (Aitschui, et al., /. MoL Biol r 215:403, 
1990). A description and Accession number are given for the 
most significant matches. PI was found to match a truncated 
entry for amylase, and P2 was found to match an unpub- 
lished entry for preprocarboxypeptidase A2 which was 
entered after GenBank R87. 

These results demonstrate that SAGE provides both quan- 
titative and qualitative data about gene expression. The use 
of different anchoring enzymes and/or tagging enzymes with 
various recognition elements lends great flexibility to this 
strategy. In particular, since different anchoring enzymes 
cleave cDNA at different sites, the use of at least 2 different 
Aes on different samples of the same cDNA preparation 
allows confirmation of results and analysis of sequences that 
might not contain a recognition site for one of the enzymes. 

As efforts to fully characterize the genome near 
completion, SAGE should allow a direct readout of expres- 
sion in any given cell type or tissue. In the interim, a major 
application of SAGE will be the comparison of gene expres- 
sion patterns in among tissues and in various developmental 
and disease states in a given cell or tissue. One of skill in the 
art with the capability to perform PGR and manual sequenc- 
ing could perform SAGE for this purpose. Adaptation of this 
technique to an automated sequencer would allow the analy- 
sis of over 1,000 transcripts in a single 3 hour run. An ABI 
377 sequencer can produce a 451 bp readout for 36 tem- 
plates in a 3 hour run (451bp/l lbp per tagx36«1476 tags). 
The appropriate number of tags to be determined will 
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depend on the application. For example, the definition of many utilities for SAGE technology, is the identification of 
genes expressed at relatively high levels (0.5% or more) in appropriate antisense or triple helix reagents which may be 
one tissue, but low in another, would require only a single therapeutically useful. Further, gene therapy candidates can 
day. Determination of transcripts expressed at greater than also be identified by the SAGE technology. Other uses 
100 mRNA's per cell (0.025% or more) should be quanti- 5 include diagnostic applications for identification of incli- 
nable within a few months by a single investigator. Use of vidual genes or groups of genes whose expression is shown 
two different Anchoring Enzymes will ensure that virtually to correlate to predisposition to disease, the presence of 
all transcripts of the desired abundance will be identified. disease, and prognosis of disease, for example. An abun- 
The genes encoding those tags found to be most interesting pro fii e , such as that depicted in Table 1, is useful for 
on the basis of their differential representation can be 10 the above described applications. SAGE is also useful for 
positively identified by a combination of data-base detection of an organism (e.g., a pathogen) in a host or 
searching, hybridization, and sequence analysis as demon- detection of infection-specific genes expressed by a patho- 
strated in Table 2. Obviously, SAGE could also be applied gen ^ a k osc 
to the analysis of organisms other than humans, and could 

direct investigation towards genes expressed in specific 15 The abmty to identify a laTge niimber of expressed genes 

biologic states *** a short P^ 04 of tinre, as described by SAGE in the 

SAGE, as described herein, allows comparison of expies- P rcsent Mention, provides unlimited uses, 

sion of numerous genes among tissues or among different Although the invention has been described with reference 

states of development of the same tissue, or between patho- to the presently preferred embodiment, it should be under- 

logic tissue and its normal counterpart. Such analysis is 20 stood that various modifications can be made without 

useful for identifying therapeutically, diagnostically and departing from the spirit of the invention. Accordingly, the 

prognosticaUy relevant genes, for example. Among the invention is limited only by the following claims. 



SEQUENCE LISTING 

( 1 ) GENERAL INFORMATION: 

( i » i ) NUMBER OF SEQUENCES: 7 



( 2 ) INFORMATION FOR SEQ ID NO:t ; 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 43 base p«rs 
( B ) TYPE: finrlrir, acid 
( C ) STRANDEDNESS: both 
( D > TOPOLOGY: bo* 

( i i ) MOLECULE TYPE: DNA (gemxoic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:l: 

TTTTACCAOC TTATTCAATT COOTCCTCTC GCACAGOGAC ATO 



( 2 ) INFORMATION FOR SEQ ID NOO: 

( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 36 bue psrs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: both 
( D ) TOPOLOOY: bodi 

( i i ) MOLECULE TYPE: DNA (geaamic) 

( X i ) SEQUENCE DESCRIPTION: SEQ ID NCn2: 

ATGOTCGAAT AAOTTAAGCC AOOAGAOCOT GTCCCT 



{ 2 ) INFORMATION FOR SEQ ID NCH3: 

< 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 44 b**e pairs 
( B ) TYPE; nucleic acid 
( C ) STRANDEDNESS: both 
( D ) TOPOLOGY: bwh 

( i i ) MOLECULE TYPE: DNA (genomic) 

{ x i ) SEQUENCE DESCRIPTION: SEQ ID NOi3: 



TTTTTOTAGA CATTCTAOTA TCTCOTCAAO TCOGAAGGGA CATO 
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-continued 



< 2 ) INFORMATION FOR 5BQ DO NOs4i 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 37 bue pairs 
( D ) TYPE: nucleic acid 
( C ) STRANDEDNESS; both 
( D ) TOPOLOGY: botfa 

( i i ) MOLECULE TYPE: DMA (geaoonc) 

< x i ) SEQUENCE DESCRIPTION: SEQ CD NO*: 

AACATCTOTA AGATCATAQA QCAGTTCAOC CTTCCCT 



< 2 ) INFORMATION FOR SEQ ID NCn5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH! 21 bate pairs 
( B ) TYPE; nocJcic acid 

< C ) STRANDEDNESS: both 

< D ) TOPOLOGY: both 

< i i ) MOLECULE TYPE: DNA (genomic) 

( x t ) SEQUENCE DESCRIPTION: SEQ 2Z> NO:S; 



( 2 ) INFORMATION FOR SEQ ID NO*. 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 21 base pairs 
( B ) TYPE: nucleic acid 
( C > STRANDEDNESS: both 
< D ) TOPOLOGY: bo* 

( i i ) MOLECULE TYPE: DNA (genomic) 

( * k ) SEQUENCE DESCRIPTION: SEQ ID NOrtK 



( 2 ) INFORMATION FOR SEQ ID NO:7: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 26 bate pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: boxh 
( D ) TOPOLOGY: bech 

( i i ) MOLECULE TYPE: DNA (genomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NCh7: 



What is claimed is: 

1* An isolated oligonucleotide composition comprising at 
least one ditag, wherein the ditag comprises two covaleatly 
joined defined nucleotide sequence tags in opposite 
orientation, wherein each tag corresponds to at least one 
expressed gene. 

2. The composition of claim 1, wherein the oligonucle- 
otide consists of about 1 to 200 ditags. 

3. The composition of claim 2, wherein the oligonucle- 
otide consists of about 8 to 20 ditags. 

4. A method for the detection of gene expression com- 
prising: 

producing complementary deoxyribonucleic acid (cDNA) 
oligonucleotides; 



isolating a first defined nucleotide sequence tag from a 
first cDNA oligonucleotide and a second defined nucle- 
55 otide sequence tag from a second cDNA oligonucle- 
otide; 

linking the first tag to a first oligonucleotide linker thereby 
forming a first linked nucleic acid, wherein the first 
oligonucleotide linker comprises a first enzyme recog- 
nition site that allows DNA cleavage at a site in the first 
defined nucleotide sequence distant from the first rec- 
ognition site; 

linking the second tag to a second oligonucleotide linker 
thereby forming a second linked nucleic acid, wherein 
the second oligonucleotide linker comprises a second 
65 enzyme recognition site that allows DNA cleavage at a 
site in the second defined nucleotide sequence distant 
from the second recognition site; 
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cleaving the first and the second linked nucleic acids with 
at least one enzyme that recognizes each of the recog- 
nition sites: 

ligating the first and second tags to form a ditag: and 
determining the nucleotide sequence of at least one tag of 
the ditag to detect gene expression. 

5. The method of claim 4, wherein the first oligonucle- 
otide linker comprises a first amplification primer hybrid- 
ization sequence, and the second oligonucleotide linker 
comprises a second amplification primer hybridization 
sequence: and 

further comprising amplifying the ditag oligonucleotide. 

6. The method of claim 4, further comprising producing 
concatemcrs of the ditag. 

7. The method of claim 6, wherein the concatemer con- 
sists of about 2 to 200 ditags. 

8. The method of claim 7, wherein the concatemer con- 
sists of about 8 to 20 ditags. 

9. The method of claim 4, wherein the first and second 
oligonucleotide linkers comprise the same nucleotide 
sequence. 

10. The method of claim 4, wherein the first and second 
oligonucleotide linkers comprise different nucleotide 
sequences. 

1L The method of claim 10, wherein the first and second 
^oligonucleotide linkers have a sequence: 

5 1 -TTITACCAGCITArrCAAlTCGGTCCTCrCGCA- 

CAGGGACATG-3* (SEQ ID NO: 1) 
3'-ArGGTCGAATAAGTTAAGCCAGGAGAGCGTG- 

TCCCT-5' (SEQ ID NO:2) 

or 

5 -TXTITGTAGAC AITCTAGTArCTCGTCAAGTCG- 

GAAGGGACATG-3* (SEQ ID NO:3) 
3'-AACArCTGTAAGArCATAGAGCAGTTCAGCCr- 

TCCCT-5\ (SEQ TO NO:4) 
wherein A is dideoxy A. 

12. The method of claim 4, wherein at least one of the 
enzyme recognition sites is a type US endonuclease recog- 
nition site. 

13. The method of claim 12, wherein the type IIS endo- 
nuclease is selected from the group consisting of BsmH and 
Fokl 

14. The method of claim 4, wherein the ditag is about 12 
to 60 base pairs. 

15. The method of claim 14, wherein the ditag is about 18 
to 22 base pairs. 

16. The method of claim 5, wherein the amplifying is by 
polymerase chain reaction (PGR). 

17. The method of claim 16, wherein primers for PGR are 
selected from the group consisting of 

5 , ^CAGCOTAr^CAA^TCGGTCC-3 , (SEO ID NO:5) 
and 

5'-GTAGACArrCrAGTArCrCGT-3' (SEQ ID NO:6). 
1& A method for detection of gene expression compris- 
ing: 

cleaving a cDNA sample with a first restriction 
endonuclease, wherein the endonuclease cleaves the 
cDNA at a defined position in the cDNA thereby & 
producing defined sequence tags; 

isolating the defined cDNA tags and forming a first pool 
of tags; 

ligating a first pool of tags with a first oligonucleotide 
linker having a first enzyme recognition site that allows 65 
DNA cleavage at a site distant from the second 
recognition site: 
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ligating a second pool of tags with a second oligonucle- 
otide linker having a second enzyme recognition site 
that allows DNA cleavage at a site distant from the 
second recognition site; 

cleaving the tags with a first and a second tag cleaving 
restriction endonuclease, wherein the first tag-cleaving 
restriction endonuclease recognizes a first enzyme rec- 
ognition site and cleaves at a site distant from the first 
recognition site and wherein the second tag-cleaving 
restriction endonuclease recognizes a second enzyme 
recognition site and cleaves at a site distant from the 
second recognition site; 

ligating the two pools of tags to produce at least one ditag; 
and 

determining the nucleotide sequence of at least one ditag, 
wherein the ditag(s) correspond to sequence from an 
expressed gene. 

19. The method of claim 18, further comprising ampli- 
fying the ditag. 

20. The method of claim 18, wherein the first restriction 
endonuclease has at least one recognition site in the cDNA. 

21. The method of claim 20, wherein the first restriction 
enzyme has a four base pair recognition site. 

22. The method of claim 21, wherein the restriction 
endonuclease is NlallX 

23. The method of claim 18, wherein the cDNA comprises 
a means for capture. 

24. The method of claim 23, wherein the means for 
capture is a binding element. 

25. The method of claim 24* wherein the binding element 
is biotin. 

26. The method of claim 18. wherein the first and second 
oligonucleotide linkers comprise the same nucleotide 
sequence. 

27. The method of claim 18, wherein the first and second 
oligonucleotide linkers comprise different nucleotide 
sequences. 

28. The method of claim 27, wherein the first and second 
oligonucleotide linkers have a sequence: 

5 f -TTTTACCAGCrTArTCAAITCGGTCCTCTCGCA- 

CAGGGACATG-3' (SEQ ID NO:l) 
3*- ATGGTCGAATAAGTTAAGCCAG GAG AGCGTG- 

TCCCr-5' (SEQ ID NO:2) 

or 

S'-TTTrrGTAGACArrCrAGTATCTCGrCAAGTCG- 

GAAGGGACATG-3* (SEQ ID NO:3) 
3 , -AACATCTGTAAGArCATAGAGCAGT^CAGCCT- 

TCCCT-5\ (SEQ ID NO:4) 
wherein A is dideoxy A. 

29. The method of claim 18, wherein at least one of the 
restriction endonuclease sites is a type DS endonuclease site. 

30. The method of claim 29, wherein the type IIS endo- 
nuclease is selected from the group consisting of BsmFI and 
FokL 

3L The method of claim 18, wherein the ditag is about 12 
to 60 base pairs. 

32. The method of claim 31, wherein the ditag is about 14 
to 22 base pairs. 

33. The method of claim 18, further comprising ligating 
the ditags to produce a concatemer. 

34. The method of claim 33, wherein the concatemer 
consists of about 2 to 200 ditags. 

35. The method of claim 34, wherein the concatemer 
consists of about 8 to 20 ditags. 

36. The method of claim 18, wherein the amplifying is by 
polymerase chain reaction (PCR). 
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37. The method of claim 36, wherein primers for PCR are 
selected from the group consisting of 

5 f -CCAGC^rA^rCAA^rCGGTCC-3 , (SEQ ID NO:5) 
and 

S^GTAGAOOTCTAGTArCrCGrr-J (SEQ ID NO:6). 

38* A kit useful for detection of gene expression wherein 
the presence of a cDNA ditag is indicative of expression of 
a gene having a sequence of a tag of the ditag, the kit 
comprising one or more containers comprising a first con- 
tainer containing a first oligonucleotide linker having a first 
sequence useful hybridization of an amplification primer; a 
second container containing a second oligonucleotide linker 
having a second oligonucleotide linker having a second 
sequence useful hybridization of an amplification primer, 
wherein the linkers further comprise a restriction endonu- 
clease site for cleavage of DNA at a site distant from the 
restriction endonuclease recognition site; and a third and 
fourth container having a nucleic acid primers for hybrid- 
ization to the first and second unique sequences of the linker. 

39. The kit of claim 38, wherein the linkers have a 
sequence 

S'-TTTTACCAGCrTAITCAArrCGGTCCrcrCGCA- 
CAGGGACATG-3 1 (SEQ ID NO:l) 
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3 ArGGTCGAATAAGTTAAGCCAGGAGAGCGTG- 
TCCCT-5* (SEQ ID NO:2) 

or 

5 * -TTTTTGTAG AC ATTCTAGTXTCTC GTCAAGTC G- 
5 GAAGGGACATG-3' (SEQ ID NOS) 

3 AACATCTGTAAG ATC ATAGAGC AGTTCAGCCT- 

TCCCT-5\ (SEQ ID NO:4) 
wherein A is dideoxy A. 

40. The kit of claim 38, wherein the restriction endonu- 
10 clease is a type IIS endonuclease. 

41. The kit of claim 40, wherein the type US endonuclease 
is BsmFL 

42. The kit of claim 38, wherein the primers for ampli- 
fication are selected from the group consisting of 

15 5 f -CCAGCITArTCAATTCGGTCC-3' (SEO ID NO:5) 
and 

5 - GTAGACATTCTAGTAT CTCGT-3 ' (SEQ 3D NO:6). 

43. The method of claim 18, wherein the first ottgonucle- 
^ otide linker comprises a fast amplification primer hybrid- 
ization sequence and the second oligonucleotide linker com- 
prises a second amplification primer hybridization sequence. 



